Author: Kamila Makhambetova
What factors among region, GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, perceptions of corruption, ladder score in dystopia affect the ladder score (happiness score) a lot in 2021? Is the estimated regression model between the the ladder score (happiness score) and stated above 8 factors appropriate for the application? How does it look?
2020 and 2021 are the hardest years for humankind in 21 century. Because of Covid 19, we face many restrictions. It is clear, people became unhappier. That’s why I decided to investigate the happiness score and factors that affect it in 2021. I chose 53 different countries from the different parts of Earth and with different happiness scores to make my data wide-spreading and unbiased.
Data was taken from World Happiness Report 2021: https://databank.worldbank.org . The World Happiness Report is a publication of the Sustainable Development Solutions Network, powered by data from the Gallup World Poll and Lloyd’s Register Foundation, which provided access to the World Risk Poll. The 2021 Report includes data from the ICL-YouGov Behaviour Tracker as part of the COVID Data Hub from the Institute of Global Health Innovation. The World Happiness Report was written by a group of independent experts acting in their personal capacities.
All 9 variables were taken from World Happiness Report 2021:
1) Happiness score or subjective well-being (variable name ladder ) was taken from the Feb 26, 2021 release of the Gallup World Poll (GWP) covering years from 2005 to 2020.
2) GDP per capita was taken from the October 14, 2020 update of the World Development Indicators (WDI).
3) Healthy Life Expectancy (HLE) at birth is based on the data extracted from the World Health Organization’s (WHO) Global Health Observatory data repository (Last updated: 2020-09-28).
4) Social support, freedom to make life choices, generosity, corruption perception are the national average of the binary responses (either 0 or 1) to the GWP questions.
5) Ladder score in dystopia was calculated by creators of World Happiness Report 2021.
Happiness score (ladder score) is the national average response to the question of life evaluations. The experts ask the residents to assess own life in this country using scaling from 0 (the worst possible life) to 10 (the best possible life).
GDP per capita (PPP based) is gross domestic product converted to international dollars using purchasing power parity rates and divided by total population.
Healthy Life Expectancy is the average number of years that a person can expect to live in “full health” by taking into account years lived in less than full health due to disease and/or injury.
Social support is the national average of the binary responses (either 0 or 1) to the GWP question “If you were in trouble, do you have relatives or friends you can count on to help you whenever you need them, or not?”
Freedom to make life choices is the national average of responses to the GWP question “Are you satisfied or dissatisfied with your freedom to choose what you do with your life?”
Generosity is the residual of regressing the national average of response to the GWP question “Have you donated money to a charity in the past month?” on GDP per capita.
Corruption Perception is the national average of the survey responses to two questions in the GWP: “Is corruption widespread throughout the government or not?” and “Is corruption widespread within businesses or not?” The overall perception is just the average of the two 0-or-1 responses.
Ladder score in dystopia is the lowest national average for each key variable and is, along with the residual error, used as a regression benchmark.
Response variable (Y): Happiness score (ladder score) in range [0,10]
Explanatory variables (\(x_i\)): GDP per capita (PPP based), region, Healthy life expectancy, social support, freedom to make life choices, Generosity, Corruption Perception, Ladder score in dystopia.
Happiness score and each explanatory variable such as GDP per capita (PPP based), Healthy Life Expectancy, Social support, Freedom to make life choices, Generosity, have a positive relationship. As the rise in one of these factors causes a rise in the happiness score. Happiness score and Corruption Perception have an inverse relationship. As the rise in corruption perception causes a fall in the happiness score. Countries that are placed in the Western Europe region have a high happiness score, but countries that are placed in Sub-Saharan Africa have a low happiness score.
An increase in GDP per capita means the increase in country’s production of goods and services. So it leads to economic growth, the quality of life in this country rises. So people became happier. An increase in social support means an increase in number of people, who have relatives or friends they can count on in the trouble. An increase in Freedom to make life choices leads to a rise in amount of satisfied people, who can participate in own life without any restrictions, forced by government, religion or society. An increase in generosity means more people participate in charity. Maybe they started to earn more money, so they have extra money to donate. An increase in healthy life expectance may represent the efficient contribution of government in the health care system, i.e. government spends a significant part of budget on the development of health care system. So the quality of life in this country increases. An increase in all these factors leads to an increase in quality of life, so the happiness score rises.
An increase in Corruption Perception means more residents think that corruption is widespread throughout the government and the business. So they lost trust in the government, juridical system, governmental structures. The quality of life decreases, which leads to falling in happiness score.
Most Western European countries are developed countries, so their economics is strong, they have low corruption, a high level of democracy, developed infrastructure, and a health care system. So the quality of life is significantly high there. So countries that are placed in Western Europe region have high happiness score Most Sub-Saharan Africa countries are developing countries, so their economics is weak, they have high corruption, low level of democracy, weak infrastructure, and health care system. So the quality of life is significantly low there. Countries that are placed in Sub-Saharan Africa have low happiness scores.
It is difficult to judge the appropriateness of this regression model by just looking at the data table and without drawing proper graphs and conducting proper tests. So all conclusions about the linearity of Y, homogeneity of variance of residuals, independence of residuals, normality of residuals will be made in the ‘Data Analysis’ section.
1. Linear modeling: happiness score vs region, GDP per capita, social support, healthy life expectancy, freedom to make life choices, generosity, perceptions of corruption, ladder score in dystopia
2. Multicollinearity:Scatterplot Matrix and the correlation Matrix, VIF
3. Model Selection: F-test, Added Variable plots, Stepwise regression, Model selection criterion : Adjusted \(R^2\), AIC, BIC, \(C_p\).
4. Residual analysis:
4.1) Independence of residuals: Sequence plot \(e_i\) vs \(i\) and Test for Independence
4.2) Normal distribution of residuals: QQ Plot, Shapiro-Wilk Test for Normality
4.3) Homogenity of Variance and linearity of \(\hat{Y}_{i}\): Residuals \(e_i\) vs Fitted values \(\hat{Y}_{i}\), Levene’s Test for Homogeneity of Variance
4.4) Ouliers: Studentized deleted residuals, leverages, DFFIT,Cook’s distance.
4.4) Multicollinearity: VIF
4.5) Added variable plots
5. Remedial measures and diagnostics/ residual analysis of new model.
knitr::opts_chunk$set(echo = TRUE)
#install.packages('plyr', repos = "http://cran.us.r-project.org")
#install.packages('XQuartz', repos = "https://www.xquartz.org/")
#install.packages("readxl")
library("readxl")
library("plyr")
#library("XQuartz")
#install.packages("xlsx")
#library(xlsx)
My_Data <- read_excel("Data2.xlsx",sheet=1,range="B2:K52", col_names = TRUE)
hapscore<-My_Data$`Ladder score`
Gdp<-My_Data$`GDP per capita`
socsup<-My_Data$`Social support`
lifeexp<-My_Data$`Healthy life expectancy`
freed<-My_Data$`Freedom to make life choices`
generos<-My_Data$`Generosity`
corrupt<-My_Data$`Perceptions of corruption`
dystopia<-My_Data$`Ladder score in Dystopia +residual`
region<-My_Data$`Regional indicator`
My_Data
## # A tibble: 50 x 10
## Country `Regional indic… `Ladder score` `GDP per capita` `Social support`
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Finland Western Europe 7.84 10.8 0.954
## 2 Denmark Western Europe 7.62 10.9 0.954
## 3 Switze… Western Europe 7.57 11.1 0.942
## 4 Iceland Western Europe 7.55 10.9 0.983
## 5 Nether… Western Europe 7.46 10.9 0.942
## 6 Norway Western Europe 7.39 11.1 0.954
## 7 Sweden Western Europe 7.36 10.9 0.934
## 8 Luxemb… Western Europe 7.32 11.6 0.908
## 9 Austria Western Europe 7.27 10.9 0.934
## 10 New Ze… North America a… 7.28 10.6 0.948
## # … with 40 more rows, and 5 more variables: `Healthy life expectancy` <dbl>,
## # `Freedom to make life choices` <dbl>, Generosity <dbl>, `Perceptions of
## # corruption` <dbl>, `Ladder score in Dystopia +residual` <dbl>
my_model=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+region, data=My_Data)
summary(my_model)
##
## Call:
## lm(formula = hapscore ~ Gdp + socsup + lifeexp + freed + generos +
## corrupt + dystopia + region, data = My_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.084598 -0.004443 0.000513 0.006929 0.044054
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.420514 0.149967 -29.477 <2e-16
## Gdp 0.330308 0.008476 38.968 <2e-16
## socsup 2.426241 0.075869 31.979 <2e-16
## lifeexp 0.029384 0.001367 21.496 <2e-16
## freed 1.304590 0.051786 25.192 <2e-16
## generos 0.598031 0.033647 17.774 <2e-16
## corrupt -0.626058 0.022294 -28.082 <2e-16
## dystopia 0.981058 0.014448 67.901 <2e-16
## regionCommonwealth of Independent States -0.019576 0.014593 -1.341 0.1884
## regionEast Asia -0.033902 0.014643 -2.315 0.0266
## regionLatin America and Caribbean -0.004396 0.013374 -0.329 0.7443
## regionMiddle East and North Africa 0.018483 0.014628 1.264 0.2147
## regionNorth America and ANZ 0.024950 0.017559 1.421 0.1642
## regionSub-Saharan Africa 0.003464 0.028897 0.120 0.9053
## regionWestern Europe 0.024608 0.013610 1.808 0.0792
##
## (Intercept) ***
## Gdp ***
## socsup ***
## lifeexp ***
## freed ***
## generos ***
## corrupt ***
## dystopia ***
## regionCommonwealth of Independent States
## regionEast Asia *
## regionLatin America and Caribbean
## regionMiddle East and North Africa
## regionNorth America and ANZ
## regionSub-Saharan Africa
## regionWestern Europe .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02035 on 35 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9993
## F-statistic: 4750 on 14 and 35 DF, p-value: < 2.2e-16
By t-value test we reject \(H_0\): \(B_{i}\)=0 for intercept, GDP, social support, life expectancy, freedom to make life choices,generosity, perceptions of corruption and happiness score in dystopia. As p-val= \(2\)x\(10^{-16}\) and p-val<0.05. For categorical variable, region, only East Asia category is significant at \(\alpha\)=0.05 as p-val=0.0266 < 0.05 so we reject H0: \(B_{9}\)=0.
By F-test which tests if all \(B_{i}\)=0 or not, we can reject \(H_0\): \(B_{1}=B_{2}=...=B_{14}=0\) as p-val=\(2\)x\(10^{-16}\) < \(\alpha\)=0.05. So there is at least one nonzero predictor. So F-test for full model and t-test for each predictor in the presence of other predictors came to the same conclusion that there are nonzero \(B\)’s. So I assume there is no strong multicollinearity between predictors.
So \(\hat{Y}_{i} = -4.42+0.33*X_{i,1}+ 2.43*X_{i,2}+0.029*{X}_{i,3} +1.304*{X}_{i,4}+0.598*{X}_{i,5}-0.626058*{X}_{i,6}+ 0.981058*{X}_{i,7}-0.0339*{X}_{i,8}\) where \(X_{i,7}\)= 0 for counties that are not placed in East Asia and \(X_{i,7}\)= 1 for the countries that are placed in East Asia region.
To analyze multicollinearity between predictor variables I decided to draw scatterplot matrix and construct the correlation matrix.
My_Data2 <- read_excel("Data2.xlsx",sheet=1,range="D2:K52", col_names = TRUE)
pairs(My_Data2 , lower.panel = NULL)
cor(My_Data2 )
## Ladder score GDP per capita Social support
## Ladder score 1.0000000 0.76130086 0.7245603
## GDP per capita 0.7613009 1.00000000 0.6913561
## Social support 0.7245603 0.69135607 1.0000000
## Healthy life expectancy 0.6570834 0.78040574 0.6994376
## Freedom to make life choices 0.6025866 0.35556647 0.5145396
## Generosity 0.4252662 -0.02444444 0.2424689
## Perceptions of corruption -0.7345505 -0.49284577 -0.4568008
## Ladder score in Dystopia +residual 0.1956136 -0.29840614 -0.3286375
## Healthy life expectancy
## Ladder score 0.65708344
## GDP per capita 0.78040574
## Social support 0.69943759
## Healthy life expectancy 1.00000000
## Freedom to make life choices 0.32798082
## Generosity -0.04717927
## Perceptions of corruption -0.42376756
## Ladder score in Dystopia +residual -0.38503564
## Freedom to make life choices Generosity
## Ladder score 0.6025866 0.42526621
## GDP per capita 0.3555665 -0.02444444
## Social support 0.5145396 0.24246886
## Healthy life expectancy 0.3279808 -0.04717927
## Freedom to make life choices 1.0000000 0.42708377
## Generosity 0.4270838 1.00000000
## Perceptions of corruption -0.5706967 -0.35728279
## Ladder score in Dystopia +residual -0.1046547 0.32589787
## Perceptions of corruption
## Ladder score -0.73455054
## GDP per capita -0.49284577
## Social support -0.45680078
## Healthy life expectancy -0.42376756
## Freedom to make life choices -0.57069671
## Generosity -0.35728279
## Perceptions of corruption 1.00000000
## Ladder score in Dystopia +residual -0.03914784
## Ladder score in Dystopia +residual
## Ladder score 0.19561365
## GDP per capita -0.29840614
## Social support -0.32863749
## Healthy life expectancy -0.38503564
## Freedom to make life choices -0.10465470
## Generosity 0.32589787
## Perceptions of corruption -0.03914784
## Ladder score in Dystopia +residual 1.00000000
I will not drop any predictor variable as the scatterplot shows the marginal relationship between Y and each predictor, but it does not give me any information about the joint relationship between Y (happiness score) and 8 predictor variables. Despite r= 0.19561365, which is a small correlation between Y and Ladder score in Dystopia, I will not drop Ladder score in Dystopia predictor. As seeing week marginal relationship between Y and Ladder score in Dystopia predictor does not mean that this predictor is not needed in a model including other predictors. Further tests are required.
Want to test whether we can drop q variables from a model that has p = k + 1 (including the intercept), p is number of \(\beta\)’s, q < p.
\(H_0:\; B_{j,1}=B_{j,2}...=B_{j,q}=0\) in the full model
If \(H_0\) is true then p-value for the test is Pr(\(F^*\; >F_{q,n-p}\))
Use R to calculate p-value
I want to test Ladder score in Dystopia is necessary or not in the presence of other predictors. As from the Correlation Matrix there is a weak positive relationship between Y (happiness score) and Ladder score in Dystopia as r= 0.19561365, which is a small correlation.
\(H_0:\; \beta_7=0\) in the full model
NOTE: For all tests I will use 95% CI
my_model=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+region, data=My_Data)
model_reduced= update(my_model, . ~ . - dystopia)
anova(my_model, model_reduced)
## Analysis of Variance Table
##
## Model 1: hapscore ~ Gdp + socsup + lifeexp + freed + generos + corrupt +
## dystopia + region
## Model 2: hapscore ~ Gdp + socsup + lifeexp + freed + generos + corrupt +
## region
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 35 0.0145
## 2 36 1.9246 -1 -1.9101 4610.5 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Reject \(H_0\): \(B_7\)=0 as p-value= \(2\)*\(10^{-16}\) < \(\alpha\)=0.05. So Ladder score in Dystopia is necessary in the presence of other predictors.
The predictor \(x_j\) has:
\(VIF_j=\frac{1}{1-R^{2}_j}\)
where \(R^{2}_j\) is the \(R^2\) from regressing \(x_j\) on the remaining predictors.
If \(VIF_j \approx 1\) then \(x_j\) is not involved in any multicollinearity.
If \(VIF_j >10\) then \(x_j\) is involved in severe multicollinearity.
library(car)
## Loading required package: carData
vif(my_model)
## GVIF Df GVIF^(1/(2*Df))
## Gdp 7.143813 1 2.672791
## socsup 4.274933 1 2.067591
## lifeexp 5.971189 1 2.443602
## freed 2.479643 1 1.574688
## generos 2.233515 1 1.494495
## corrupt 2.687152 1 1.639253
## dystopia 2.855025 1 1.689682
## region 78.208205 7 1.365313
From computations of Variance inflation factor we can see that VIF-s for all predictor variables except region are less than 10. So GDP, social support, life expectancy, freedom to make life choices,generosity, perceptions of corruption and happiness score in dystopia are not involved in severe multicollinearity, but VIF of region= 78.2 > 10, region is involved in severe multicollinearity.
I decided to drop categorical variable ,region, as by t-test only East Asia region is significant. If country belongs to East Asia and other predictors are constant happiness score decreases by 0.033902 units. If country does not belong to East Asia and other predictors are constant happiness score stays unchanged. In my opinion, happiness score decreases by 0.033902 units is very small change. Also, the region is involved in severe multicollinearity. So I can just ignore and drop it.
Added variable plots are refined plots to help figure out if the “non-linear” pattern is there when other variables are added.
Gives an idea of the functional form of \(x_j\): a transformation in \(x_j\) should mimic the pattern seen in the plot.
#par(mfrow=c(4,2))
#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx1=fit11$residuals
fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex1onwithoutx1=fit12$residuals
plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")
#scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)")
#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx2=fit21$residuals
fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex2onwithoutx2=fit22$residuals
plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx3=fit31$residuals
fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
rex3onwithoutx3=fit32$residuals
plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
reYonithoutx4=fit41$residuals
fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
rex4onwithoutx4=fit42$residuals
plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")
#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
reYonithoutx5=fit51$residuals
fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
rex5onwithoutx5=fit52$residuals
plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")
#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
reYonithoutx6=fit61$residuals
fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
rex6onwithoutx6=fit62$residuals
plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")
#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
reYonithoutx7=fit71$residuals
fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
rex7onwithoutx7=fit72$residuals
plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")
By all 7 Added variable plots for each predictor variable we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model. Also as it is a linear relationship, I do not need to do any transformation on any \(x_j\). I think I do not need to add polynomial terms.
The Stepwise regression is based on 2 methods: Forward selection when we add variables to the model and Backward elimination when we remove predictors from the model.
I added all possible predictors to model and choose \(\alpha_s=0.15\) and \(\alpha_e=0.1\).
There are 21 possible interactions terms, as 7C2=21 and 7 second order terms.
In total my model has 35 predictors
library(olsrr)
##
## Attaching package: 'olsrr'
## The following object is masked from 'package:datasets':
##
## rivers
My_Data <- read_excel("Data2.xlsx",sheet=1,range="B2:K52", col_names = TRUE)
My_Data$GDP2=My_Data$`GDP per capita`^2
My_Data$socsup2=My_Data$`Social support`^2
My_Data$lifeexp2=My_Data$`Healthy life expectancy`^2
My_Data$freed2=My_Data$`Freedom to make life choices`^2
My_Data$generos2=My_Data$Generosity^2
My_Data$corrupt2=My_Data$`Perceptions of corruption`^2
My_Data$dystopia2=My_Data$`Ladder score in Dystopia +residual`^2
m2 = update(my_model, . ~ . + GDP2 + socsup2+lifeexp2+freed2+generos2+corrupt2+dystopia2)
m3 = update(m2,. ~ . + Gdp*socsup+Gdp*lifeexp+Gdp*freed+Gdp*generos+Gdp*corrupt+Gdp*dystopia)
m4 = update(m3,. ~ . + socsup*lifeexp+socsup*freed+socsup*generos+socsup*corrupt+socsup*dystopia)
m5 = update(m4,. ~ . + lifeexp*freed+lifeexp*generos+lifeexp*corrupt+lifeexp*dystopia)
m6 = update(m5,. ~ . + freed*generos+freed*corrupt+freed*dystopia)
m7 = update(m6,. ~ . + generos*corrupt+generos*dystopia+corrupt*dystopia)
ols_step_both_p(m7, pent = 0.1, prem = 0.15)
##
## Stepwise Selection Summary
## --------------------------------------------------------------------------------------------------
## Added/ Adj.
## Step Variable Removed R-Square R-Square C(p) AIC RMSE
## --------------------------------------------------------------------------------------------------
## 1 Gdp:freed addition 0.686 0.680 89704.7240 60.1247 0.4243
## 2 dystopia addition 0.828 0.821 49218.9360 32.1315 0.3177
## 3 generos2 addition 0.829 0.818 48943.1970 33.8488 0.3202
## 4 dystopia removal 0.706 0.694 84077.5450 58.8860 0.4151
## 5 dystopia2 addition 0.816 0.804 52594.5740 37.4435 0.3319
## 6 Gdp:freed removal 0.040 0.000 274631.7410 118.0525 0.7502
## 7 Gdp:socsup addition 0.907 0.901 26550.3170 3.3040 0.2359
## 8 generos2 removal 0.907 0.903 26619.8270 1.4383 0.2337
## 9 freed2 addition 0.955 0.952 12855.8020 -32.8743 0.1643
## 10 corrupt2 addition 0.972 0.969 8014.7430 -54.4141 0.1313
## 11 generos addition 0.975 0.972 7096.5840 -58.4794 0.1250
## 12 lifeexp addition 0.995 0.994 1438.7830 -135.3015 0.0575
## 13 corrupt addition 0.995 0.994 1329.7000 -137.2170 0.0559
## 14 freed2 removal 0.983 0.981 4790.4410 -76.0216 0.1040
## 15 freed addition 0.995 0.994 1334.7140 -137.0335 0.0560
## 16 Gdp:socsup removal 0.881 0.864 34063.6790 21.7373 0.2763
## 17 Gdp:dystopia addition 0.984 0.981 4672.8270 -75.2764 0.1039
## 18 corrupt2 removal 0.983 0.981 4784.6380 -76.0818 0.1039
## 19 socsup2 addition 0.998 0.998 541.5820 -180.3458 0.0363
## 20 socsup addition 0.998 0.998 464.2370 -185.7622 0.0341
## 21 lifeexp2 addition 0.998 0.998 463.5230 -184.0364 0.0345
## 22 socsup removal 0.998 0.998 509.4890 -181.3988 0.0357
## 23 region addition 0.999 0.998 277.8100 -195.6407 0.0295
## 24 socsup2 removal 0.986 0.980 4042.3100 -68.4920 0.1059
## 25 socsup:lifeexp addition 0.999 0.998 298.5910 -192.3742 0.0305
## 26 Gdp:dystopia removal 0.976 0.967 6753.2910 -42.9892 0.1366
## 27 Gdp:lifeexp addition 0.999 0.999 240.3300 -202.1326 0.0277
## 28 lifeexp:dystopia addition 0.999 0.999 163.6070 -217.3425 0.0236
## 29 Gdp addition 0.999 0.999 149.5130 -219.7291 0.0230
## 30 lifeexp:dystopia removal 0.999 0.999 241.0210 -200.3755 0.0280
## 31 lifeexp:corrupt addition 0.999 0.999 166.5010 -215.1098 0.0241
## 32 Gdp removal 0.999 0.999 186.7670 -211.6371 0.0250
## 33 GDP2 addition 0.999 0.999 184.1090 -210.7336 0.0251
## 34 lifeexp:corrupt removal 0.999 0.999 241.9640 -200.2004 0.0281
## 35 socsup:dystopia addition 0.999 0.999 178.1240 -212.1785 0.0248
## 36 lifeexp removal 0.999 0.999 194.3530 -209.9016 0.0255
## 37 generos:dystopia addition 0.999 0.999 161.4690 -216.4342 0.0238
## 38 socsup:lifeexp removal 0.999 0.998 300.5140 -190.3860 0.0310
## 39 socsup:corrupt addition 0.999 0.999 209.7390 -204.9787 0.0266
## 40 Gdp:lifeexp removal 0.999 0.999 223.0500 -203.8323 0.0271
## 41 freed:corrupt addition 0.999 0.999 187.6980 -209.8867 0.0254
## 42 generos:dystopia removal 0.999 0.999 198.5850 -208.9589 0.0257
## 43 corrupt:dystopia addition 0.999 0.999 175.3200 -212.8701 0.0246
## --------------------------------------------------------------------------------------------------
I added all possible 21 interaction terms (7C2) and 7 possible quadratic terms. So in total I had 35 predictors in my model.
By stepwise selection the final model has 10 predictor variables:
1) 4 1st order variables: Generosity, Corruption, Freedom to make life choices, GDP. 3 1st order predictor variables such as Social support, Healthy life expectancy, Ladder score in Dystopia were deleted.
2) 3 2nd order terms: \(Ladder\; score\; in\; Dystopia^2\), \(Life\; expectancy^2\), \(GDP^2\).
3) 3 interaction terms: \(social\;support\times Ladder\; score\; in\; Dystopia\), \(social\; support\times corruption\), \(freedom\times corruption\).
I decided to leave all these predictor variables and add 3 1st order removed predictor variables such as Social support, Healthy life expectancy, Ladder score in Dystopia, as by added variable plots these predictors explain residual variability once the rest of the predictors are in the model.
1) Adjusted \(R^2\)
The adjusted \(R^2\) provides a measure of how good the model will predict data not used to build the model.
When irrelevant variables are added, adjusted \(R^2\) decreases.
The bigger \(R^2\), the better model is.
2) Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC)
AIC favors models with small SSE, but penalizes models with too many variables p.
The lower AIC, the better model is.
BIC is similar to AIC, but penalty term is more severe.
The lower BIC, the better model is.
3) Mallow’s \(C_p\) \(C_p\) assesses the biasness of model and fit of data.
If \(C_p\) ≈ p, then the reduced model predicts as well as the full model.
If \(C_p\) < p then the reduced model is estimated to be less biased than the full model
The lower \(C_p\) , the better model is.
#m12 = update(my_model, . ~ . -region+ GDP2 + lifeexp2+dystopia2)
#m13= update(m12, . ~ . +socsup*dystopia +socsup*corrupt+ freed*corrupt)
library(leaps)
allhapscore <- regsubsets( hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+GDP2 + lifeexp2+dystopia2+socsup*dystopia +socsup*corrupt+ freed*corrupt, nbest=4, data=My_Data)
all_output <- summary(allhapscore)
with(all_output, round(cbind(which, rsq, adjr2, cp, bic), 3))
## (Intercept) Gdp socsup lifeexp freed generos corrupt dystopia GDP2 lifeexp2
## 1 1 0 0 0 0 0 0 0 1 0
## 1 1 1 0 0 0 0 0 0 0 0
## 1 1 0 0 0 0 0 0 0 0 0
## 1 1 0 0 0 0 0 1 0 0 0
## 2 1 0 0 0 0 0 0 0 1 0
## 2 1 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0 1
## 2 1 0 0 1 0 0 0 0 0 0
## 3 1 1 0 0 0 0 1 0 0 0
## 3 1 0 0 0 0 0 1 0 1 0
## 3 1 1 0 0 0 0 0 0 0 0
## 3 1 0 0 0 0 0 0 0 1 0
## 4 1 0 0 0 1 0 0 0 1 0
## 4 1 0 0 0 1 0 1 0 1 0
## 4 1 1 0 0 1 0 0 0 0 0
## 4 1 1 0 0 1 0 1 0 0 0
## 5 1 0 0 0 1 0 0 0 1 1
## 5 1 0 0 1 1 0 0 0 1 0
## 5 1 1 0 0 1 0 0 0 0 1
## 5 1 1 0 1 1 0 0 0 0 0
## 6 1 0 0 0 1 1 0 0 1 1
## 6 1 0 0 1 1 1 0 0 1 0
## 6 1 1 0 0 1 1 0 0 0 1
## 6 1 1 0 1 1 1 0 0 0 0
## 7 1 1 1 1 1 1 0 1 0 0
## 7 1 1 1 1 1 1 1 1 0 0
## 7 1 1 1 1 1 1 0 1 0 0
## 7 1 0 1 1 1 1 0 1 1 0
## 8 1 1 1 1 1 1 1 1 0 0
## 8 1 1 1 1 1 1 0 1 0 0
## 8 1 1 1 1 1 1 0 1 1 0
## 8 1 1 1 1 1 1 0 1 0 0
## dystopia2 socsup:dystopia socsup:corrupt freed:corrupt rsq adjr2 cp
## 1 0 0 0 0 0.597 0.589 20447.298
## 1 0 0 0 0 0.580 0.571 21332.363
## 1 0 1 0 0 0.569 0.560 21869.964
## 1 0 0 0 0 0.540 0.530 23367.097
## 2 0 1 0 0 0.909 0.905 4608.316
## 2 0 1 0 0 0.904 0.900 4818.740
## 2 0 1 0 0 0.831 0.824 8534.126
## 2 0 1 0 0 0.826 0.819 8802.789
## 3 0 1 0 0 0.967 0.965 1623.003
## 3 0 1 0 0 0.967 0.965 1625.432
## 3 0 1 1 0 0.960 0.958 1979.901
## 3 0 1 1 0 0.959 0.957 2029.521
## 4 0 1 0 1 0.984 0.983 759.372
## 4 0 1 0 0 0.984 0.983 766.399
## 4 0 1 0 1 0.984 0.982 788.145
## 4 0 1 0 0 0.984 0.982 792.267
## 5 0 1 1 0 0.992 0.991 372.636
## 5 0 1 1 0 0.992 0.991 386.580
## 5 0 1 1 0 0.991 0.990 411.214
## 5 0 1 1 0 0.991 0.990 426.535
## 6 0 1 1 0 0.997 0.997 96.523
## 6 0 1 1 0 0.997 0.997 107.272
## 6 0 1 1 0 0.997 0.997 112.795
## 6 0 1 1 0 0.997 0.996 124.956
## 7 0 0 0 1 0.999 0.999 4.992
## 7 0 0 0 0 0.999 0.999 5.007
## 7 0 0 1 0 0.999 0.999 6.882
## 7 0 0 0 1 0.999 0.999 8.375
## 8 0 0 0 1 0.999 0.999 5.891
## 8 1 0 0 1 0.999 0.999 6.188
## 8 0 0 0 1 0.999 0.999 6.268
## 8 0 1 0 1 0.999 0.999 6.338
## bic
## 1 -37.615
## 1 -35.501
## 1 -34.259
## 1 -30.955
## 2 -107.840
## 2 -105.628
## 2 -77.247
## 2 -75.705
## 3 -155.304
## 3 -155.232
## 3 -145.594
## 3 -144.382
## 4 -188.080
## 4 -187.643
## 4 -186.312
## 4 -186.064
## 5 -217.474
## 5 -215.805
## 5 -212.985
## 5 -211.308
## 6 -270.110
## 6 -266.210
## 6 -264.319
## 6 -260.391
## 7 -327.367
## 7 -327.348
## 7 -325.001
## 7 -323.208
## 8 -324.888
## 8 -324.498
## 8 -324.392
## 8 -324.301
By selection model criterion the best model has the lowest BIC=-327.367, lowest \(C_p\)= 4.992 and the highest adjusted \(R^2\)= 0.999. This model has 7 predictor variables such as GDP, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Ladder score in Dystopia and one interaction term freedom*corruption.
By looking at selection model criterion, constructing different plots and applying Stepwise Selection method I decided to use linear model Y (happiness score) and 8 predictor variables: GDP, Social support, Healthy life expectancy, Freedom to make life choices, Generosity, Perceptions of corruption,Ladder score in Dystopia, freedom*corruption.
Assumptions of MLR
\(\epsilon_1,\epsilon_2....\epsilon_n \backsim^{iid} N(0,\sigma^2)\)
A linear relationship between E[Y] and associated predictors \(x_1\), . . . , \(x_k\) .
If the model is appropriate for the given data, the residuals \(e_i\) should reflect the assumed properties for the error terms \(\epsilon_i\).
The plotting \(e_i \; against\; i\) i.e residuals vs its index allows to determine if there is any correlation between error terms, such we can check the independence of residuals.
library(lawstat)
##
## Attaching package: 'lawstat'
## The following object is masked from 'package:car':
##
## levene.test
library(latex2exp)
my_model_fin=lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia+corrupt*freed, data=My_Data)
re = my_model_fin$residuals
plot(re, main="Graph 3: Time Series Plot of the Residuals", xlab=TeX("i"), ylab=TeX("e_i"))
lines(re)
abline(h=0)
\(H_0:\; e_i\; are\; independent\; i=1,2,...,n\)
\(H_a:\; e_i\; are\; not\; independent\; i=1,2,...,n\)
Use R to calculate p.m.f. of random variable U
\(p-value=\Pr(U\leq u)\)
NOTE: For all tests I will use 95% CI
library(lawstat)
runs.test(re, plot.it=TRUE)
##
## Runs Test - Two sided
##
## data: re
## Standardized Runs Statistic = 0, p-value = 1
Here A is non-negative residuals \(e_i\geq0\), B is negative residuals \(e_i<0\), u=25. p-value=1 or 100%
According to Graph 3 residuals are independent, as there is no repeated pattern, and I do not see a clear relationship between \(e_i\) and \(i\). Any \(e_j\) can not be predicted by the previous resudials. So there does not exist any dependence.
Also, The Run Test of Independence failed to reject \(H_0:\; e_i\; are\; independent\; i=1,2,...,n\) as p-value of 2 sided Runs Test is 1 or 100% is more than \(\frac{\alpha}{2}\) i.e. 0.025.
Also, p-value=100% is more than any usual used \(\frac{\alpha}{2}\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. Then \(0.5 \leq\frac{\alpha}{2}\leq5\) in %. So \(H_a:\; e_i\; are\; independent\; i=1,2,...,n\) is true. So, the graph 3 and The Run Test for Independence show the same conclusion, the independence of residuals.
To check the normality of residuals we can construct Normal Q-Q plot of the residuals. Here each residual is plotted against its expected value under normality.
Points located near/ on a straight line suggest that these residuals are normally distributed, whereas points that depart from a straight line suggest that these residuals are not normally distributed.
qqnorm(re)
qqline(re)
hist(re, xlab='residuals', main='Graph 4: Histogram of residuals', freq=FALSE)
curve(dnorm(x, mean(re), sd(re)), add=TRUE)
By Normal Q-Q Plot of residuals it is difficult to determine residuals are normal distributed or not. Residuals in the range of theoretical quantiles from -1 till 1 are lies very close or on a normal curve, so they are normally distributed and other residuals with theoretical quantiles > 1 or < -1 lie with the change of theoretical quantiles depart far away from normal curve.
According to graph 4, it is clear seen that residuals are normally distributed, as histogram has a normal curve’s shape. We can see that normal distribution is negative skew as it has a small left tail.
So I decided to conduct Shapiro-Wilk test to decide the residuals are normal distributed or not.
\(H_0: e_1,e_2,...,e_n\sim N\)
\(H_a: e_1,e_2,...,e_n\nsim N\)
Test statistics W and p-value is calculated by R.
shapiro.test(re)
##
## Shapiro-Wilk normality test
##
## data: re
## W = 0.61696, p-value = 3.507e-10
By Shapiro-Wilk Test for Normality W=0.61696 and p-value =3.507*\(10^{-10}\) or approximately 0%. As I used 95% CI \(\alpha=0.05\; or\; 5\)%. \(p-value=3.507*10^{-10}\;<\;\alpha=0.05\), so reject \(H_0: e_1,e_2,...,e_n\sim N\) and accept \(H_a\) . It means the residuals are not normally distributed.
To check the linearity and homogeneity of Variance assumptions, we can plot Residuals \(e_i\) vs Fitted values \(\hat Y_i\) graph. If the variance of residuals is homogeneous, we expect to see a constant spread/distance of the residuals to the 0 line across all \(\hat Y_i\) values. If the linear model is a good fit, we expect to see the residuals evenly spread on either side of the 0 line.
fits = my_model_fin$fitted.values
plot(re ~ fits,ylim=c(-0.05,0.05), main="Graph 5: Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "),
ylab=TeX("e_i "))
#scatter.smooth(re ~ fits, main="Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "), ylab=TeX("e_i"))
abline(h=0)
Firstly, according to Graph 5, there can be parabolic relationship between \(e_i\) and \(\hat Y_i\), as residuals are not evenly spread around y=0 line. So there is nonlinear relation between \(\hat Y_i\) and \(x_1,x_2,...,x_8\). Secondly, residuals do not depart from y=0 or become closer to y=0 with the increase of \(\hat Y_i\). So there is a homogeneity of variance of residuals.
Split responses (happiness score) into t distinct groups based on predictor values (GDP per capita)
\(H_0: \sigma_1^2=...=\sigma_t^2\)
\(H_a: \sigma_1^2\neq...\neq\sigma_t^2\)
\(Test\; Statistics\;(T.S.)\sim F_{t-p,n-t}\)
where t is number of groups, n is number of observations
The \(p-value=\Pr(F_{t-p,n-t}\geq T.S.)\)
library(lawstat)
#Look at plot Graph 1 to divide data into 3 groups
breaks = c(8, 9,10, 12)
groups = cut(Gdp, breaks)
levene.test(re, groups)
##
## Modified robust Brown-Forsythe Levene-type test based on the absolute
## deviations from the median
##
## data: re
## Test Statistic = 0.54928, p-value = 0.581
For applying Levene’s Test I divided my data into 3 groups [\(8\leq GDP<9\), \(9\leq GDP<10\),\(10\leq GDP<12\) ] based on predictor values, GDP. The calculated p=0.581 or 58.1% and it is more than \(\alpha=0.05\) or 5% and even more than any usual \(\alpha\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. So we fail to reject \(H_0: \sigma_1^2=...=\sigma_t^2\), which means the variance of residuals is constant/homogeneous. So I got the same result as from graph 5: Residuals vs Fitted values.
Note that the calculated p-value depends on how we divide our data into groups, what interval we used and on the number of intervals/groups.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:olsrr':
##
## cement
#Compute the studentized delted residuals
stud.del.res <- studres(my_model_fin)
print(stud.del.res)
## 1 2 3 4 5
## -1.348584716 -0.381198858 0.160782908 -0.146475114 0.488515870
## 6 7 8 9 10
## -0.328987531 -0.090739330 0.395778947 0.291583961 0.282675555
## 11 12 13 14 15
## 0.530326432 0.549005610 0.652611931 0.156666039 0.621176616
## 16 17 18 19 20
## 0.350273584 0.526139413 -0.432639572 1.965599849 -0.268210053
## 21 22 23 24 25
## -0.286285476 -0.433784527 0.513166122 0.016485114 -0.584524003
## 26 27 28 29 30
## -0.529138874 0.219122854 0.293306297 -0.233357910 0.162354143
## 31 32 33 34 35
## -0.008648775 -0.316575228 -0.118426326 0.171064698 0.008750378
## 36 37 38 39 40
## 0.987168299 -78.681573781 -0.368297800 -0.031418401 -0.645481346
## 41 42 43 44 45
## 1.413019328 -0.550967177 0.044865465 0.852585266 0.235144025
## 46 47 48 49 50
## -0.259758149 0.628112241 0.588333880 -0.187605154 0.504442809
#Calculate the Bonferroni critical value for outliers:
alpha = 0.05
t <- qt(1-alpha/(2*length(stud.del.res)),
length(stud.del.res)-9-1)
sprintf("The critical value is t= %s", t)
## [1] "The critical value is t= 3.55096576086335"
paste0("Any studentized residuals > ",t," ? ",
any(abs(stud.del.res) > t))
## [1] "Any studentized residuals > 3.55096576086335 ? TRUE"
By studentized deleted residual the critical t = 3.550966. By comparing absolute value of Studentized deleted residuals with the critical t, one outlier was determined. Observation 37, which |Studentized deleted residuals| = 78.681573781 is outlier as 78.68> critical t=3.5099. So \(Y_{37}\) is outlier.
diagonal <- lm.influence(my_model_fin)$hat
print("The diagonal elements of the hat matrix")
## [1] "The diagonal elements of the hat matrix"
print(diagonal)
## 1 2 3 4 5 6 7
## 0.37450674 0.18513628 0.09777545 0.19337525 0.12145252 0.17398166 0.13927137
## 8 9 10 11 12 13 14
## 0.12235626 0.06612678 0.12963523 0.07609614 0.18530789 0.17906369 0.16011798
## 15 16 17 18 19 20 21
## 0.19421057 0.14370538 0.23756278 0.12367323 0.25655680 0.13195533 0.09190666
## 22 23 24 25 26 27 28
## 0.31177785 0.07900939 0.05365264 0.15854458 0.18691244 0.13473540 0.06063661
## 29 30 31 32 33 34 35
## 0.18073585 0.16710458 0.19470378 0.07931640 0.16929044 0.16238816 0.22395241
## 36 37 38 39 40 41 42
## 0.16500939 0.28477023 0.08154887 0.09363863 0.25039959 0.34557368 0.08775988
## 43 44 45 46 47 48 49
## 0.34423890 0.28777524 0.21189543 0.39623731 0.15889273 0.35668983 0.22370615
## 50
## 0.16532963
paste0("Any h_ii > ",2*mean(diagonal)," ? ",
any(diagonal > 2*mean(diagonal)))
## [1] "Any h_ii > 0.36 ? TRUE"
By rule of thumb any leverage \(h_{ii}\) > 2\(\bar{h}\) is flagged as having “high” leverage. In my project 2\(\bar{h}\)=0.36. By rule of thumb there are 2 influential points. They are observation 1 and observation 46. \(h_{1,1}\)=0.37450674 and \(h_{46,46}\)= 0.39623731, they > 2\(\bar{h}\)=0.36. As they are more than 2\(\bar{h}\), my model may be extrapolating far outside the general region of my data.
#g) Calculate Cook's distance D; for each case and prepare an index plot. Are any cases
#influential according to this measure?
table <- data.frame(My_Data$`Ladder score`,cooks.distance(my_model_fin),pf(cooks.distance(my_model_fin),9,length(My_Data$`Ladder score`)-9))
colnames(table) = c("Y", "Cooks distance", "F percentile")
head(table)
## Y Cooks distance F percentile
## 1 7.842 0.1186217970 9.650723e-04
## 2 7.620 0.0037464181 2.828066e-10
## 3 7.571 0.0003188568 4.396187e-15
## 4 7.554 0.0005854712 6.763173e-14
## 5 7.464 0.0037350515 2.789801e-10
## 6 7.392 0.0025892886 5.392405e-11
index=1:length(My_Data$`Ladder score`)
plot(index,cooks.distance(my_model_fin), type="l", col="red", lwd=3, xlab="Case index", ylab="Cook's distance", main="Index plot")
paste0("Any F percentile > ",0.5," ? ",
any(pf(cooks.distance(my_model_fin),9,length(My_Data$`Ladder score`)-9) > 0.5))
## [1] "Any F percentile > 0.5 ? TRUE"
To detect high Cook’s distance we calculate F(p,n-p), where n is a sample size and p is number of \(\beta\)’s. Compare F of each Cook’s distance with 0.5. If F>0.5, then this Cook’s distance is high.
By table of Cook’s distance and Index plot there is one outlier. It is case 37. Cook’s distance of case 37 is 1.802160 and its F percentile = 0.9027 > 0.5. So case 37 has disproportionate influence on the fitted regression surface as a whole.
table2 <- data.frame(My_Data$`Ladder score`,dffits(my_model_fin))
colnames(table2) = c("Y", "DFFIT")
head(table2)
## Y DFFIT
## 1 7.842 -1.04351030
## 2 7.620 -0.18169998
## 3 7.571 0.05292946
## 4 7.554 -0.07171806
## 5 7.464 0.18163491
## 6 7.392 -0.15098588
In table 2 of DFFIT we can see that observation 37 has very high absolute value of DFFIT = 49.647517358 relatively to absolute value DFFIT of other observations, which are in range [0;1.1]. Also DFFIT of observation 37 > 1. So, observation 37 is outlier.
By computation of Studentized deleted residuals, leverages, Cook’s distances, DFFITs 3 ouliers were detected. They are case 37, case 1, case 46. They may affect the fitted regression function more than other points. If the outlying points follow the modeling assumptions and are representative, they may strengthen inference and reduce error in predictions. If not, outlying values may skew inference a lot and yield models with poor predictive properties.
library(car)
vif(my_model_fin)
## Gdp socsup lifeexp freed generos
## 3.178866 3.076378 3.159432 49.048744 1.786673
## corrupt dystopia freed:corrupt
## 373.473056 1.551261 274.017663
#mod=update(my_model_fin,. ~ . -freed*corrupt+freed+corrupt)
#vif(mod)
By calculating VIF we can see that corrruption, freedom and \(freedom\times corruption\) are multicollinear as their VIFs >10. I expected that they will be involved in a severe multicollinearity as \(freedom\times corruption\) is an interaction term between corruption and freedom.
From Graph 5: Residuals \(e_i\) vs Fitted values \(\hat Y_i\), I came to conclusion that there is parabolic relationship between Y and \(x_1\), \(x_2\), ….,\(x_8\). So I need to make a transformation on predictors to fix this violation. Added variable plots help to reveal which transformation on which predictor I should apply.
My_Data$corruptfreed=My_Data$`Perceptions of corruption`*My_Data$`Freedom to make life choices`
#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx1=fit11$residuals
fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex1onwithoutx1=fit12$residuals
#plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")
scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)")
#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx2=fit21$residuals
fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex2onwithoutx2=fit22$residuals
#plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
scatter.smooth(x=rex2onwithoutx2, y=reYonithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx3=fit31$residuals
fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex3onwithoutx3=fit32$residuals
#plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
scatter.smooth(x=rex3onwithoutx3, y=reYonithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx4=fit41$residuals
fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia+corruptfreed, data=My_Data)
rex4onwithoutx4=fit42$residuals
#plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")
scatter.smooth(x=rex4onwithoutx4, y=reYonithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")
#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia+corruptfreed, data=My_Data)
reYonithoutx5=fit51$residuals
fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia+corruptfreed, data=My_Data)
rex5onwithoutx5=fit52$residuals
#plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")
scatter.smooth(x=rex5onwithoutx5, y=reYonithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")
#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia+corruptfreed, data=My_Data)
reYonithoutx6=fit61$residuals
fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia+corruptfreed, data=My_Data)
rex6onwithoutx6=fit62$residuals
#plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")
scatter.smooth(x=rex6onwithoutx6, y=reYonithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")
#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+corruptfreed, data=My_Data)
reYonithoutx7=fit71$residuals
fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt+corruptfreed, data=My_Data)
rex7onwithoutx7=fit72$residuals
#plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")
scatter.smooth(x=rex7onwithoutx7, y=reYonithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")
#Added var plot for x8
fit81 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx8=fit81$residuals
fit82 = lm(corruptfreed~ Gdp+socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex8onwithoutx8=fit82$residuals
#plot(reYonithoutx8 ~ rex8onwithoutx8, main="Graph 8: Added-Variable Plot for x8 (corruption*freedom)", xlab="e(x8|others)", ylab="e(Y|others without x8) ")
scatter.smooth(x=rex8onwithoutx8, y=reYonithoutx8, main="Graph 8: Added-Variable Plot for x8 (corruption*freedom)", xlab="e(x8|others)", ylab="e(Y|others without x8) ")
By 7 Added variable plots for predictor variables such as GDP, social support, life expectancy, freedom to make life choice, generosity, ladder score in Dystopia, Perceptions of corruption, we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model. Also, as it is linear relation, I do not need to do any transformation on any \(x_j\).
However on graph Graph 8: Added-Variable Plot for \(x_8\) (corruption*freedom) there is horizontal line, it means that \(x_8\) predictor does not explain any residual variability once the rest of the predictors are in the model.
Remedial measures are some procedures that we can apply to fix problems of our Regression model such as Nonlinear Relation, Non-Constant Variance, Non-Independence of Errors, Non-Normality of Errors, Outliers.
According to the results of the tests and analysis of the plots, my Regression model holds only 2 basic assumptions of MLR such as homogeneity of variance of residuals, independence of residuals.
By making diagnostics, I found out that 3 basic assumptions were violated. Residuals of the model have non-normal distribution. There is non-linear relationship between Y and \(x_1\),\(x_2\),…,\(x_8\). Also, 3 predictors (corruption, freedom, \(freedom\times corruption\)) are involved into severe multicollinearity.
Added variable plots for each predictor show that that \(x_8\) (\(corruption\times freedom\)) predictor does not explain any residual variability once the rest of the predictors are in the model.
By computation of Studentized deleted residuals, leverages, Cook’s distances, DFFITs 3 outliers were detected. There are 3 outliers. They are case 37, case 1, case 46. All methods except leverage detected case 37 as an outlier.
To fix these problems I decided to remove \(x_8\) (\(corruption\times freedom\)) predictor and observation 37 from the model to fix these violations.
I created new sheet in my Excel table and manually deleted observation 37 from it. Also, I removed \(x_8\) (\(corruption\times freedom\)) predictor from the model.
require(foreign)
## Loading required package: foreign
require(MASS)
library("readxl")
library("plyr")
new_Data <- read_excel("Data2.xlsx",sheet=2,range="B2:K51", col_names = TRUE)
hapscore_new<-new_Data$`Ladder score`
Gdp_new<-new_Data$`GDP per capita`
socsup_new<-new_Data$`Social support`
lifeexp_new<-new_Data$`Healthy life expectancy`
freed_new<-new_Data$`Freedom to make life choices`
generos_new<-new_Data$`Generosity`
corrupt_new<-new_Data$`Perceptions of corruption`
dystopia_new<-new_Data$`Ladder score in Dystopia +residual`
irls= lm(hapscore_new~ Gdp_new+socsup_new+lifeexp_new+freed_new+generos_new+corrupt_new+dystopia_new, data=new_Data)
summary(irls)
##
## Call:
## lm(formula = hapscore_new ~ Gdp_new + socsup_new + lifeexp_new +
## freed_new + generos_new + corrupt_new + dystopia_new, data = new_Data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0074348 -0.0008362 0.0002282 0.0009887 0.0042172
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.556e+00 7.508e-03 -606.8 <2e-16 ***
## Gdp_new 3.498e-01 5.120e-04 683.1 <2e-16 ***
## socsup_new 2.239e+00 5.983e-03 374.2 <2e-16 ***
## lifeexp_new 3.148e-02 9.204e-05 342.0 <2e-16 ***
## freed_new 1.220e+00 4.272e-03 285.6 <2e-16 ***
## generos_new 6.574e-01 2.766e-03 237.7 <2e-16 ***
## corrupt_new -6.372e-01 1.703e-03 -374.1 <2e-16 ***
## dystopia_new 9.991e-01 9.521e-04 1049.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.001828 on 41 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.168e+06 on 7 and 41 DF, p-value: < 2.2e-16
By t-value test we reject \(H_0\): \(B_{i}\)=0 for all predictors in new model. As p-val= <\(2\)x\(10^{-16}\) and p-val<0.05.
By F-test which tests if all \(B_{i}\)=0 or not, we can reject \(H_0\): \(B_{1}=B_{2}=...=B_{14}=0\) as p-val= <\(2\)x\(10^{-16}\) < \(\alpha\)=0.05. So there is at least one nonzero predictor. So F-test for full model and t-test for each predictor in the presence of other predictors came to the same conclusion that there are nonzero \(\beta\)’s.
Also, adjusted \(R^2=1\), which means our model will predict data not used to build the model very well. Also, the model has MSE=0.
So \(\hat{Y}_{i} = -4.556+0.349*X_{i,1}+ 2.239*X_{i,2}+0.031*{X}_{i,3} +1.22*{X}_{i,4}+0.657*{X}_{i,5} -0.637*{X}_{i,6}+ 0.999*{X}_{i,7}\).
After changing my model I will make diagnostics again to see does new model follow the basic assumptions of MLR.
# 1) Independence of residuals
library(lawstat)
library(latex2exp)
re_new = irls$residuals
#a) plot: e_i vs i
plot(re_new, main="Graph 9: Time Series Plot of the Residuals", xlab=TeX("i"), ylab=TeX("e_i"))
lines(re_new)
abline(h=0)
#b) Runs Test for Independence
library(lawstat)
runs.test(re_new, plot.it=TRUE)
##
## Runs Test - Two sided
##
## data: re_new
## Standardized Runs Statistic = 0.43624, p-value = 0.6627
#4.2) Checking normality of residuals
#a) Normal Q-Q Plot
qqnorm(re_new,ylim=c(-0.01,0.01))
qqline(re_new)
#b) Histogram of residuals
hist(re_new, xlab='residuals', main='Graph 10: Histogram of residuals', freq=FALSE)
curve(dnorm(x, mean(re_new), sd(re_new)), add=TRUE)
#c) Shapiro-Wilk Test for Normality
shapiro.test(re_new)
##
## Shapiro-Wilk normality test
##
## data: re_new
## W = 0.8808, p-value = 0.0001374
#Checking Homogeneity of Variance and linearity of $\hat{Y}_{i}$
fits = irls$fitted.values
plot(re_new ~ fits, ylim=c(-0.01,0.01), main="Graph 11: Residuals vs Fitted values", xlab=TeX("\\hat{Y}_i "), ylab=TeX("e_i "))
abline(h=0)
#b) Levene’s Test for Homogeneity of Variance
library(lawstat)
#Look at plot Graph 1 to divide data into 3 groups
breaks = c(8, 9,10, 12)
groups = cut(Gdp_new, breaks)
levene.test(re_new, groups)
##
## Modified robust Brown-Forsythe Levene-type test based on the absolute
## deviations from the median
##
## data: re_new
## Test Statistic = 0.8438, p-value = 0.4366
1) Independence of residuals
According to Graph 9: Time Series Plot of the Residuals, residuals are independent, as there is no repeated pattern, and I do not see a clear relationship between \(e_i\) and \(i\). Any \(e_j\) can not be predicted by the previous residuals. So there does not exist any dependence.
Also, The Run Test of Independence failed to reject \(H_0:\; e_i\; are\; independent\; i=1,2,...,n\) as p-value of 2 sided Runs Test is 0.6627 or 66.27% is more than \(\frac{\alpha}{2}\) i.e. 0.025.
Also, p-value=66.27% is more than any usual used \(\frac{\alpha}{2}\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. Then \(0.5 \leq\frac{\alpha}{2}\leq5\) in %. So \(H_a:\; e_i\; are\; independent\; i=1,2,...,n\) is true. So, the graph 9 and The Run Test for Independence show the same conclusion, the independence of residuals.
2) Normality of residuals
By Normal Q-Q Plot of residuals it is difficult to determine residuals are normal distributed or not. Residuals in the range of theoretical quantiles from -1.5 till 1.5 lie very close to or on a normal curve, so they are normally distributed and other residuals with theoretical quantiles > 1.5 or < -1.5 with the change of theoretical quantiles depart far away from normal curve.
According to graph 10: Histogram of Residuals, it is clear seen that residuals are normally distributed, as histogram has a normal curve’s shape. We can see that normal distribution is negative skew as it has small left tail.
So I decided to conduct Shapiro-Wilk test to decide the residuals are normal distributed or not. By Shapiro-Wilk Test for Normality W=0.8808 and p-value =0.0001374 or approximately 0.01%. As I used 95% CI \(\alpha=0.05\; or\; 5\)%. p-value=0.0001374 < \(\alpha=0.05\) so I reject \(H_0: e_1,e_2,...,e_n\sim N\) and accept \(H_a\) . It means the residuals are not normally distributed.
3) Homogeneity of Variance and linearity of \(\hat{Y}_{i}\)
Firstly, according to Graph 11: Residuals vs Fitted values, there is no clear pattern between \(e_i\) and \(\hat Y_i\).Residuals evenly spread on either side of the 0 line. So there is linear relationship between Y and \(x_1\),…,\(x_7\). Secondly, residuals do not depart from y=0 or become closer to y=0 with the increase of \(\hat Y_i\). So there is a homogeneity of variance of residuals.
For applying Levene’s Test I divided my data into 3 groups [\(8\leq GDP<9\), \(9\leq GDP<10\),\(10\leq GDP<12\) ] based on a predictor variable, GDP. The calculated p=0.4366 or 43.66% and it is more than \(\alpha=0.05\) or 5% and even more than any usual \(\alpha\). Usually \(\alpha\) is \(1 \leq\alpha\leq10\) in %. So we fail to reject \(H_0: \sigma_1^2=...=\sigma_t^2\), which means the variance of residuals is constant/homogeneous. So I got the same result as from graph 11.
###Studentized deleted residuals
library(MASS)
#Compute the studentized delted residuals
stud.del.res <- studres(irls)
print(stud.del.res)
## 1 2 3 4 5 6
## 1.37853191 1.44257055 -0.22329394 0.41614910 -0.42731904 -1.29208499
## 7 8 9 10 11 12
## -0.30308858 -0.48334884 0.32019755 0.30333339 0.30650499 -0.21404107
## 13 14 15 16 17 18
## 0.49038819 -5.58297177 0.80430395 -1.03439399 0.29997286 -0.39757212
## 19 20 21 22 23 24
## 3.00933372 1.16524367 0.76054440 0.39172197 -0.39633936 0.60236250
## 25 26 27 28 29 30
## 0.15849904 -0.38297904 -0.56620787 -1.20048595 -0.51007232 1.67819486
## 31 32 33 34 35 36
## -0.88149730 0.12836370 0.67187034 -0.50111583 0.09714979 0.16210621
## 37 38 39 40 41 42
## -0.02608556 -0.19853905 0.74944596 -0.77263544 -0.96330073 -0.59465185
## 43 44 45 46 47 48
## 0.68894665 -0.43254825 -1.08390061 0.46971753 0.20196557 0.60395332
## 49
## 0.66009937
#Calculate the Bonferroni critical value for outliers:
alpha = 0.05
t <- qt(1-alpha/(2*length(stud.del.res)),
length(stud.del.res)-8-1)
sprintf("The critical value is t= %s", t)
## [1] "The critical value is t= 3.54395250327835"
paste0("Any studentized residuals > ",t," ? ",
any(abs(stud.del.res) > t))
## [1] "Any studentized residuals > 3.54395250327835 ? TRUE"
diagonal <- lm.influence(irls)$hat
print("The diagonal elements of the hat matrix")
## [1] "The diagonal elements of the hat matrix"
print(diagonal)
## 1 2 3 4 5 6 7
## 0.34660147 0.14322717 0.09759096 0.18386491 0.11300372 0.09791491 0.10487565
## 8 9 10 11 12 13 14
## 0.12152539 0.05680724 0.12910794 0.07859997 0.14412735 0.09769284 0.07927742
## 15 16 17 18 19 20 21
## 0.16549908 0.14440774 0.24189521 0.12296642 0.29722135 0.13474573 0.09301971
## 22 23 24 25 26 27 28
## 0.30423764 0.08293250 0.05192022 0.15803448 0.16819974 0.13601217 0.06304312
## 29 30 31 32 33 34 35
## 0.18083375 0.16694097 0.16035696 0.07707504 0.16197947 0.16114961 0.11699467
## 36 37 38 39 40 41 42
## 0.15853531 0.06362119 0.08849877 0.24360833 0.32887598 0.09246225 0.33568260
## 43 44 45 46 47 48 49
## 0.28239961 0.20506457 0.39659426 0.15585249 0.28654103 0.21085848 0.16772259
paste0("Any h_ii > ",2*mean(diagonal)," ? ",
any(diagonal > 2*mean(diagonal)))
## [1] "Any h_ii > 0.326530612244898 ? TRUE"
table3 <- data.frame(new_Data$`Ladder score`,cooks.distance(irls),pf(cooks.distance(irls),8,length(new_Data$`Ladder score`)-8))
colnames(table3) = c("Y", "Cooks distance", "F percentile")
head(table3)
## Y Cooks distance F percentile
## 1 7.842 0.123299724 2.049038e-03
## 2 7.620 0.042368352 3.861277e-05
## 3 7.571 0.000690006 3.182157e-12
## 4 7.554 0.004977283 8.475499e-09
## 5 7.464 0.002967094 1.078587e-09
## 6 7.392 0.022287390 3.190128e-06
index=1:length(new_Data$`Ladder score`)
plot(index,cooks.distance(irls), type="l", col="red", lwd=3, xlab="Case index", ylab="Cook's distance", main="Index plot")
paste0("Any F percentile > ",0.5," ? ",
any(pf(cooks.distance(irls),8,length(new_Data$`Ladder score`)-8) > 0.5))
## [1] "Any F percentile > 0.5 ? FALSE"
table4 <- data.frame(new_Data$`Ladder score`,dffits(irls))
colnames(table4) = c("Y", "DFFIT")
head(table4)
## Y DFFIT
## 1 7.842 1.0040213
## 2 7.620 0.5898165
## 3 7.571 -0.0734311
## 4 7.554 0.1975229
## 5 7.464 -0.1525238
## 6 7.392 -0.4256883
1) Studentized deleted residuals
By studentized deleted residuals the critical t = 3.5439. By comparing absolute value of Studentized deleted residuals with the critical t, the one outlier was determined. Observation 14, which |Studentized deleted residuals| = 5.58297177 is outlier as 5.58297177 > critical t=3.5439. So \(Y_{14}\) is outlier.
2) Leverage
By rule of thumb any leverage \(h_{ii}\) > 2\(\bar{h}\) is flagged as having “high” leverage. In my project 2\(\bar{h}\)=0.32653. By rule of thumb there are 3 influential points. They are observation 1, observation 40, observation 42. \(h_{1,1}\)=0.34660147, \(h_{40,40}\)= 0.32887598, \(h_{42,42}\)= 0.33568260, they > 2\(\bar{h}\)=0.32653. As they are more than 2\(\hat{h}\), my model may be extrapolating far outside the general region of my data.
3) Cook’s distance
To detect high Cook’s distance we calculate F(p,n-p), where n is a sample size and p is number of \(\beta\)’s. Compare F of each Cook’s distance with 0.5. If F>0.5, then this Cook’s distance is high.
By Index plot there is one observation, observation 19, has a high Cook’s distance (0.4001318) relatively to others. However, I do not think it is influential as by table of Cook’s distance its F percentile=0.08600397 < 0.5.
4) DFFITs
In table 4 of DFFIT we can see that observation 14 (DFFIT = -1.638233500), observation 19 (DFFIT = 1.957046176) have very high absolute value of DFFIT relatively to absolute value DFFIT of other observations, which are in range [0;1]. Also |DFFIT| of these observations > 1. So, they are outliers.
library(car)
vif(irls)
## Gdp_new socsup_new lifeexp_new freed_new generos_new corrupt_new
## 3.204327 3.237903 3.283171 1.931430 1.866073 1.939538
## dystopia_new
## 1.522190
By calculating VIF we can see that VIFs of each predictor are in range [1.5;3.3]. So every VIF < 10. This means that there is no predictor, which is involved in a severe multicollinearity.
#Added var plot for x1
fit11 = lm(hapscore~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx1=fit11$residuals
fit12 = lm(Gdp ~ socsup+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex1onwithoutx1=fit12$residuals
#plot(reYonithoutx1 ~ rex1onwithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)", xlab="e(x1|others)", ylab="e(Y|Y on without x1) ")
scatter.smooth(x=rex1onwithoutx1, y=reYonithoutx1, main="Graph 1: Added-Variable Plot for x1 (GDP)",xlab="e(x1|others)", ylab="e(Y|others without x1)")
#Added var plot for x2
fit21 = lm(hapscore~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx2=fit21$residuals
fit22 = lm(socsup~ Gdp+lifeexp+freed+generos+corrupt+dystopia, data=My_Data)
rex2onwithoutx2=fit22$residuals
#plot(reYonithoutx2 ~ rex2onwithoutx2, main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
scatter.smooth(x=rex2onwithoutx2, y=reYonithoutx2,main="Graph 2: Added-Variable Plot for x2 (social support)", xlab="e(x2|others)", ylab="e(Y|others without x2) ")
#Added var plot for x3
fit31 = lm(hapscore~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
reYonithoutx3=fit31$residuals
fit32 = lm(lifeexp~ Gdp+socsup+freed+generos+corrupt+dystopia, data=My_Data)
rex3onwithoutx3=fit32$residuals
#plot(reYonithoutx3 ~ rex3onwithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
scatter.smooth(x=rex3onwithoutx3, y=reYonithoutx3, main="Graph 3: Added-Variable Plot for x3 (life expectancy)", xlab="e(x3|others)", ylab="e(Y|others without x3) ")
#Added var plot for x4
fit41 = lm(hapscore~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
reYonithoutx4=fit41$residuals
fit42 = lm(freed~ Gdp+socsup+lifeexp+generos+corrupt+dystopia, data=My_Data)
rex4onwithoutx4=fit42$residuals
#plot(reYonithoutx4 ~ rex4onwithoutx4, main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")
scatter.smooth(x=rex4onwithoutx4, y=reYonithoutx4,main="Graph 4: Added-Variable Plot for x4 (freedom to make life choice)", xlab="e(x4|others)", ylab="e(Y|others without x4) ")
#Added var plot for x5
fit51 = lm(hapscore~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
reYonithoutx5=fit51$residuals
fit52 = lm(generos~ Gdp+socsup+lifeexp+freed+corrupt+dystopia, data=My_Data)
rex5onwithoutx5=fit52$residuals
#plot(reYonithoutx5 ~ rex5onwithoutx5, main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")
scatter.smooth(x=rex5onwithoutx5, y=reYonithoutx5,main="Graph 5: Added-Variable Plot for x5 (generosity)", xlab="e(x5|others)", ylab="e(Y|others without x5) ")
#Added var plot for x6
fit61 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
reYonithoutx6=fit61$residuals
fit62 = lm(corrupt~ Gdp+socsup+lifeexp+freed+generos+dystopia, data=My_Data)
rex6onwithoutx6=fit62$residuals
#plot(reYonithoutx6 ~ rex6onwithoutx6, main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")
scatter.smooth(x=rex6onwithoutx6, y=reYonithoutx6,main="Graph 6: Added-Variable Plot for x6 (Perceptions of corruption)", xlab="e(x6|others)", ylab="e(Y|others without x6) ")
#Added var plot for x7
fit71 = lm(hapscore~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
reYonithoutx7=fit71$residuals
fit72 = lm(dystopia~ Gdp+socsup+lifeexp+freed+generos+corrupt, data=My_Data)
rex7onwithoutx7=fit72$residuals
#plot(reYonithoutx7 ~ rex7onwithoutx7, main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")
scatter.smooth(x=rex7onwithoutx7, y=reYonithoutx7,main="Graph 7: Added-Variable Plot for x7 (Ladder score in Dystopia)", xlab="e(x7|others)", ylab="e(Y|others without x7) ")
By all 7 Added variable plots for each predictor variable we can see that there is a strong linear relationship between residuals(\(Y\)| \(x_{-j}\)) and residuals(\(x_{j}\)|\(x_{-j}\)). So every \(x_j\) explains residual variability once the rest of the predictors are in the model, so I need to keep all 7 predictors in my model. Also, as it is linear relationship, I do not need to apply any transformation on any \(x_j\).
I regress Y (happiness score) on my 8 predictors (region,GDP,corruption,freedom to make life choice,social support, healthy life expectancy,generosity,Ladder score in Dystopia). After conducting the appropriate tests and plotting appropriate graphs I came to the conclusion that I need to drop categorical variable, Regional indicator, as by F-test it is not necessary and it has a small influence on the fitted value.
Then I made model selection by applying Stepwise regression and looking at values of model selection criterion such as Adjusted \(R^2\), \(AIC\), \(C_p\). I got a new model, which was based on 8 predictors (GDP,corruption,freedom to make life choice,social support, healthy life expectancy,generosity,Ladder score in Dystopia, \(freedom\times corruption\)).
After applying the studied on the lectures methods I found out that new regression model holds the homogeneity of variance, the normality, the independence assumptions. However, it violates the linearity between Y and \(x_1,...,x_8\), multicollinearity and the model has 3 outliers.The most influential observation was observation 37. By calculating VIF and constructing added variable plot I came to conclusion that an interaction term \(freedom\times corruption\) was involved in severe multicollinearity and did not explain any residual variability.
As a remedial measure, I dropped interaction term \(freedom\times corruption\) from the model and removed observation 37.
I made diagnostics again. The new model violates only normality of residuals assumption and has 4 outliers (observation 1, observation 14, observation 40, observation 42). However, these outliers do not negatively affect the regression model, because adjusted \(R^2=1\), which means the model will predict data not used to build the model very well and the model has MSE=0. Also, by F-test and t-test the presence of each predictor is required and Added Variable plots showed that each predictor explains the residual variability in the presence of other predictors.
Also, I can ignore non-normality of residuals, the sample size =50 is large enough, as according to Central Limit Theorem the sampling distribution of the estimates will converge toward a normal distribution as N increases to infinity, when residuals are independent and identically distributed, and when variance of residuals is homogeneous.
So my remedial measure was efficient and I successfully find the right model, which holds all basic assumptions of MLR.
I would like to mention some suggestions that can improve my analysis:
1. Use larger samples, try to find more observations.
The larger size of sample you have, the more precise results you will get and the more accurate your regression model will be. Also, if sample size is large enough, non-normality of residuals can be ignored.
2. Use another predictor variables instead of social support,freedom to make life choices, Generosity, Corruption Perception
All these 4 predictor variables are the national average of the binary responses (either 0 or 1) to the GWP questions, where answers “Yes” or “No”. For calculating them researchers make a survey of residents. However, opinions of residents can be biased. For example, there can be residents that never traveled to another countries and did not see the quality of life in another countries, that’s why they think that the quality of life in their countries are bad. So I would like to replace them with approved world recognized economic indexes. For example instead of using Corruption Perception from the World Happiness Report 2021, I would like to use Corruption Perceptions Index (CPI) next time. Generosity can be replaced by annually amount of donations made by the residents of a particular country.
Also, these variables are categorical nominal. If I was one of author of World Happiness Report 2021, I would like to make them ordinal. Instead of asking GWP questions, that have only 2 answers (“Yes” or “No”), I would like to ask GWP question such as from 1 to 10 please assess the freedom to choose our life, where 1 - you can’t make life decisions, 10- you have absolute freedom to make life decisions by yourself.
3. I would like to add new predictor variables to the model
The response variable Y in my project was Happiness score (ladder score), which is the national average response to the question of life evaluations. The experts ask the residents to assess their own life in this country using a scale from 0 (the worst possible life) to 10 (the best possible life). I think I need to add new predictors such as HDI, literacy rate, unemployment rate, poverty rate, Gini index, to my model, as all these economic indexes have a direct relation to assessment of quality of life in countries. For example, Gini index helps to measure income inequality in the country. The addition of these predictors to my model can improve the fit of the model and its prediction property.
1. The data for the analysis was taken from World Happiness Report 2021: https://worldhappiness.report/ed/2021/.
2. The analysis is based on the lecture slides of professor Zhenisbek Assylbekov and on Chapter 6, Chapter 7, Chapter 8, Chapter 9, Chapter 10 and Chapter 11 of Kutner’s “Applied Linear Statistical Models” textbook.
Comment: